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[57] ABSTRACT 

Multiplier units of the modified Booth decoder and 
carry-save adder/full adder combination are used to 
implement a pipeline active filter wherein pixel data is 
processed sequentially, and each pixel need only be 
accessed once and multiplied by a predetermined num- 
ber of weights simultaneously, one multiplier unit for 
each weight. Each multiplier unit uses only one row of 
carry-save adders, and the results are shifted to less 
significant multiplier positions and one row of full ad- 
ders to add the carry to the sum in order to provide the 
correct binary number for the product Wp. The full 
adder is also used to add this product Wp to the sum of 
products 2Wp from preceding multiply units. If mXm 
multiplier units are pipelined, the system would be capa- 
ble of processing a kernel array of mXm weighting 
factors. 


8 Claims, 14 Drawing Figures 
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PIPELINE ACTIVE FILTER UTILIZING A BOOTH 
TYPE MULTIPLIER 

ORIGIN OF INVENTION 5 

The invention described herein was made in the per- 
formance of work under a NASA contract and is sub- 
ject to the provisions of Section 305 of the National 
Aeronautics and Space Act of 1958, Public Law 85-568 
(72 St at. 435; 42 USC 2457). 10 

BACKGROUND OF THE INVENTION 

This invention relates to a pipeline active filter of the 
type commonly used as a convolver or correlator for 
image enhancement, data filtering, correlation, pattern 
extraction, Synthetic Aperture Radar (SAR) data pro- 
cessing, and the like. 

These applications for an active filter can all use the 
same general group of convolution operations, namely 2Q 
summation of the weighted values of an input data 
stream of picture elements (pixels) representing (usu- 
ally) a two-dimensional image. Weighting is accom- 
plished by multiplying each pixel value by a set of 35 by 
35 weighting factors, for example, to create a new out- 25 
put value for each pixel. 

In conventional processing, using a standard digital 
computer, or an array processor, the data are processed 
sequentially, using a repetition of multiplication and 
summation operations on each pixel value. Thus, on an 30 
image of 1000 by 1000 pixels, filtered by a 35 by 35 
weight mask (a realistic requirement), the data must be 
accessed and multiplied 1000x1000x35x35 times, to 
produce one full image. This amount of processing is 
obviously very slow and expensive, and thus greatly 35 
limits application of the convolution processing. When 
compared to the speed of acquisition of the data, even 
from a spacecraft transmitting slow-scan television 
frames, the disparity is seen to be great. 

One solution is to provide more than one multiplier, 40 
and to process the data in a pipeline fashion, thereby 
arranging to hold the input data stream access require- 
ments to a minimum. If the process were embodied in 
dedicated VLSI hardware, rather than in software 
(computer program), this solution could be more 45 
readily accomplished, and produced in quantities at a 
reasonable cost. In processing a 1000 by 1000 pixel 
image by a kernel of 35 by 35 weights, each pixel need 
only be accessed once and multiplied by all weights 
simultaneously, with the result that the entire image 50 
processing operation requires only 1000 by 1000 succes- 
sive accesses — a saving of 1225 to one. (For two-dimen- 
sional convolution the incoming pixels are delayed for 
one image line length between each row of the kernel). 

One approach suggested by Professor Carver Mead 55 
of California Institute of Technology to the present 
invention is to use the modular algorithm described by 
Danny Cohen, “Mathematical Approach to Computa- 
tional Networks,” Information Sciences Institute, 
U.S.C. ISI/RR-78-73 ARPA Order No. 2223, Novem- 60 
ber 1978. That algorithm is diagrammed in FIG. 1. The 
pixel data (typically 8 bits per pixel) is input at X for 
multiplication by weights Wi, W2, W3 and W4. Each 
section adds the new product (temporary product) with 
the output sum, S, of the previous section, indicated by 65 
a plus sign in a circle. The product sum from the previ- 
ous section is passed through a unit time delay Z. Note 
that no delays are needed in the “upper” line. 
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In a digital processor, multiplication is carried out by 
a succession of additions, and when carried out in a 
digital computer using binary arithmetic, carry opera- 
tions usually take up most of the operating time. Conse- 
quently much effort has gone into the design of addi- 
tion/carry algorithms and circuits to reduce the carry 
time in digital processors. The need to propagate the 
carry can be made to occur less frequently than the 
remaining internal addition operations if some additions 
are carried out with carry-save operations so that only 
one carry-propagate operation occurs per several addi- 
tion operations. The digital processor will then be more 
efficient. 

A suitable logic design for the multipliers is shown in 
FIGS. 2a and 2c which can be implemented in VLSI 
chips as described by Rodney T. Masumoto in a thesis 
for an Electrical Engineer degree at California Institute 
of Technology, May 18, 1978. The logic design imple- 
ments the special case of ternary multiplication often 
referred to as a modified Booth algorithm summarized 
in the following truth table, wherein the columns 
headed Y/+ 1, Y/and Y/_ 1 represent three successive bits 
of a multiplier, and the respective notations 1 X and 2X 
mean one times and two times a multiplicand. 


Vf+1 

BOOTH ALGORITHM TRUTH TABLE 

Y 

Y/-1 

ADD 

SUB 

IX 

2X 
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0 

0 

1 

0 

0 

0 

ADD ZERO 

0 

0 

1 

1 

0 

1 

0 

ADD IX 

0 

1 

0 

1 

0 

1 

0 

ADD IX 

0 

1 

1 

1 

0 

0 

1 

ADD 2X 

1 

0 

0 

0 

1 

0 

1 
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0 

1 

0 

1 

1 

0 

SUB IX 

1 

1 

0 

0 

1 

1 

0 

SUB IX 

1 

1 

1 

0 

1 

0 

0 

SUB ZERO 


The column on the right is an interpretation of the 
operation to be executed in view of the outputs in the 
four columns headed ADD, SUB, 1 X and 2x. 

A constant shift of two bits of the multiplier Y occurs 
between examinations of the multiplier bit sets Y/+ 1, Y /, 
Y/_ 1. After each shift, the logic looks at the present two 
multiplier bits Y/and Y/+ 1 and the previous bit Y/_ 1. (In 
conventional multipliers, the multiplier bits are exam- 
ined one at a time). The multiplication action controlled 
by the logic diagram of FIG. 2 a through the logic dia- 
gram of FIG. 2c allows merely shifting or not shifting 
under 2 X or 1 X control, and inverting or not inverting 
under ADD or SUB control, the multiplicand bit to be 
added in a carry-save adder shown in FIG. 3a before 
examining the next bits of the multiplier (pixel). 

The logic circuit for decoding the set of three multi- 
plier (pixel) bits shown in FIG. 2 a may be implemented 
with FET NOR gates as shown in FIG. 2b. The logic 
requires an exclusive-OR gate 1 to form the command 
1 x =Y/©Y/_j, add or subtract the multiplicand, de- 
pending on whether Y/+i is a bit 0 or a bit 1, and an 
exclusive-OR gate 2 followed by an AND function gate 
3 to form the command 2 X = (Y/©Y/_ i)(Y *+i ©Y*) add 
or subtract twice the multiplicand, depending on 
whether Y f+ i is a bit 0 or a bit 1. The logic symbols 
employed are conventional, with a small circle at the 
output signifying an inverting logic element. 

It should be noted that if a pixel is to be multiplied by 
a set of weights simultaneously by an array of multipli- 
ers using the modified Booth decoder, only one decoder 
is required, but a separate shifter/-inverter circuit is 
required for each weight. Such a circuit defined by the 
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logic diagram in FIG. 2c can be implemented with FET 
devices as shown in FIG. 2d 

The advantage of the circuits in FIGS. 2b and 2d is 
that they can be implemented with n-MOS integrated 
circuit techniques for a very large scale integrated 5 
(VLSI) circuit, together with the carry-save adder of 
FIG. 3, but the main advantage of this modified Booth 
decoder and shifter/inverter for multiplication, as used 
in the present invention, is that it substantially reduces 
the number of addition operations, and the time re- 10 
quired for those operations. That is because carrys are 
saved until a final product sum is to be formed in a full 
adder, at which time carry propagation is allowed while 
the next pixel is being multiplied by the same weight. 
FIG. 4 illustrates a full adder which can be imple- 
mented with n-MOS integrated circuit techniques as 
described by Masumoto, supra. 

An entire array of multiplier/adder circuits tends to 
defeat the desire for providing many multipliers in a 
small area, but if each of the multipliers uses only one 2 o 
row of adders in the add-shift manner of the modified 
Booth algorithm, much space is saved, and the opera- 
tion will still be much faster than a conventional com- 
puter process. This speed is made possible by a carry- 
save adder at each multiplier position implemented as 25 
shown in FIG. 3 and described by Masumoto, supra. 
The following is the truth table of the carry-save adder. 
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0 
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While the truth table of the carry-save (half) adder is 
identical to that of a full adder, there are subtle circuit 40 
and operational differences. A full adder propagates 
carries; a carry-save adder defers the carry propagation 
to the next adder cycle. Thus, there is no carry assimila- 
tion delay. 

FIG. 3 b illustrates a conventional logic diagram of an 4 5 
adder that uses two exclusive-NOR gates. FIG. 4 illus- 
trates an equivalent logic diagram using only NOR 
gates which, although it requires more gates, can be 
implemented more easily with n-MOS integrated circuit 
techniques. Both of these may be used to implement a 50 
carry-save adder or a full adder. The difference is only 
in how they are used. In a carry-save adder, the carry 
and sum outputs C 0 and S 0 are both saved in storage 
devices, and the carry is not propagated. Instead, both 
the carry and sum are added to a new bit, indicated as X 55 
in FIG, 36, during a following bit multiplication cycle. 

In a full adder shown in FIG. 4, there is only one out- 
put, S 0 . The carry, C 0 , is not an output of the adder; 
instead it is an internal signal that is propagated as C / to 
a stage of higher order. The carry from the next lower 60 
adder is shown as C/in FIG. 4. The other two inputs, S 
and C, are stored sum and carry bits from a carry-save 
adder. 

Both the carry-save adder and the full adder thus 
implemented in VLSI chips are used by Masumoto, 65 
supra, differently. In a 16 by 16 bit multiplier accumula- 
tor using a modified Booth decoder to sum successive 
products, Masumoto uses eight cascaded carry-save 
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adders and one final adder with carry propagation, a 
full adder. The present invention utilizes only one car- 
ry-save adder for forming a product by successive addi- 
tions of the multiplicand without carry propagation, 
and a full adder to assimilate the separately stored carry 
with the sum for a binary number sometimes referred to 
hereinafter as a “temporary product.” The full adder is 
then used a second time to add this temporary product 
to a sum of products, thus allowing successive products 
to be summed in pipeline multiplying units while pro- 
cessing a stream of data, such as pixels in successive 
lines of a stored frame of pixels. 

The carry-save adder as used in this invention accepts 
three signals; the local binary product X, a carry signal 
C/+i from the next more significant adder, and a sum 
signal from the second next more significant digit carry- 
save adder S/+2, respectively, where the carry and sum 
have been calculated during the previous clock cycle so 
that in the logic diagram of FIG. 3a the inputs C/and S/ 
for the outputs S 0 and C 0 are actually C/+i and S/+2* In 
contrast, a full adder immediately adds the carry that is 
produced to the sum of its next more significant neigh- 
bor to produce the correct number. 

Since the correct product must ultimately be pro- 
duced, both carry-save adders and full adders are re- 
quired by the present invention, but as just noted above, 
only one carry-save adder and one full adder is needed 
for each product binary digit so a unit to multiply an 
8-bit pixel with a 16-bit weight requires an array of 25 
multipliers (24 bits + carry), or for a truncated product, 
a lesser number, such as 22, each multiplier being com- 
prised of a common modified Booth decoder, shifter- 
inverter, carry-save adder and full adder. These multi- 
3 5 plier units use a triple bit examination approach to re- 
duce multiply/add operations (and circuits) to half of 
the conventional equivalent. Together with the reduced 
ratio of carry operations to sum operations, this oper- 
ates to make the process quite fast in this active filter. 

SUMMARY OF THE INVENTION 

In accordance with the present invention, multiply 
units of the modified Booth decoder and carry-save 
adder/full adder combination are used to implement a 
pipeline active filter wherein data elements are pro- 
cessed sequentially, and each element need only be 
accessed once to be multiplied by a number of weights 
simultaneously, one multiply unit for each weight. Each 
multiply unit uses a modified Booth decoder and only 
one row of carry-save adders, and the results are trans- 
ferred to less significant multiplier positions for addition 
in subsequent operations for multiplication of bits of the 
element by weight. Each carry-save adder thus accepts 
a sum signal and a carry signal from more significant bit 
multiplier positions without having to add the carry 
signal it receives to obtain the correct sum. Each multi- 
plier uses one row of full adders to add the carry to the 
sum in order to provide the correct binary number, Wp, 
for the product, and to also add the product to a sum of 
products 2Wp from preceding pipelined multiply units. 
If mXm multiplier units are pipelined, the system 
would be capable of processing a kernel array ofmXm 
weighting factors. 

BRIEF DESCRIPTION OF THE DRAWINGS 

FIG. 1 illustrates a prior-art finite impulse response 
filter for multiplying pixels in sequence by a set of 
weights W 1 through W4. 
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FIG. 2a is a logic diagram of a prior-art modified 
Booth algorithm useful in implementing a multiplier for 
the present invention. 

FIG. 2b is an FET circuit diagram for VLSI imple- 
mentation of the logic diagram of FIG. 2a. 5 

FIG. 2c is a logic diagram for a shifter/inverter useful 
with a modified Booth decoder for implementing a 
multiplier for the present invention. 

FIG. 2d is an FET circuit diagram for VLSI imple- 
mentation of the logic diagram of FIG. 2c. 10 

FIG. 3 a is a VLSI circuit diagram of a prior-art car- 
ry-save adder useful in implementing the present inven- 
tion, and FIG. 3b is a logic diagram of that adder. 

FIG. 4 is a logic diagram of a prior-art full adder 
useful in implementing the present invention. 15 

FIG. 5 illustrates the organization of an array of mul- 
tipliers comprised of a common modified Booth de- 
coder and a row of shifter/inverters (S/I), one S/I for 
each multiplier with one of a row of carry-save adders 
and one of a row of full adders for successive multipli- 20 
cation of pixels by a weight in accordance with the 
present invention. 

FIG. 6 is a timing diagram for the operation of the 
array of multipliers of FIG. 5. 

FIG. 7 illustrates schematically the manner in which 25 
five arrays of multipliers are organized for five weights 
on one VLSI chip. 

FIG. 8 illustrates schematically the manner in which 
the chip of FIG. 7 is combined with six more chips for 
simultaneously multiplying pixels by 35 weights in a 30 
one-dimensional image processor (filter). 

FIG. 9 illustrates the manner in which a plurality of 
one-dimensional image processing filters as shown in 
FIG. 8 are combined to form a 35 by 35 dimensional 
image processor (filter). 35 

FIG. 10 illustrates the manner in which a 35 by 35 
filter shown in FIG, 9 may be used in a system to pro- 
cess pixel data. 

DESCRIPTION OF PREFERRED m 

EMBODIMENTS 

A multiply unit for a pipeline active filter will now be 
described with reference to FIG. 5 and timing diagram 
shown in FIG. 6. Subsequent figures will show that for 
a specific embodiment, it is contemplated that 8-bit 45 
pixels will be multiplied by a kernel of 35 by 35 weights. 
This requires one multiply unit for each weight. Assum- 
ing a 16 bit weight, the product wll consist of 24 bits 
plus a carry. However, where the accuracy of the less 
significant bits are not required, the product may be 50 
truncated, such as to 22 bits. The multiply unit must 
then have one bit multiplier for each bit of the product. 

Shown in FIG. 5 are just three multipliers of a multi- 
ply unit. The center one is for the i-th bit, and adjacent 
ones for the less significant bit, i — 1, and the more signif- 55 
icant bit iq- L Each multiplier utilizes a common modi- 
fied Booth decoder (figs. 2a, 2b), and is comprised of a 
shifter/inverter, carry-save adder and full adder, all of 
which have been described by Masumoto, supra, incor- 
porated herein by reference, but it should be understood 60 
that these prior-art components are being used herein 
only as examples of conventional components, instead 
of others that could be used, to illustrate the best mode 
contemplated for practicing the invention, which is 
with VLSI technology like that contemplated by 65 
Masumoto for his 16x16 multiplier/accumulator. 

The multiply unit (multiplier/accumulator) of 
Masumoto differs in organization from the present in- 
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vention. It requires eight rows of carry-save adders and 
one row of full adders, as noted hereinbefore. Thus, 
while specific prior-art circuits have been disclosed by 
Masumoto for a preferred VLSI implementation of this 
invention, it should be understood that this invention 
resides in the organization of a modified Booth decoder 
and a single row of each of (1) shifters/inverters (S/I) 
responsive to the decoder, (2) carry-save adders, and (3) 
full adders to form a multiply unit, and in the use of a 
plurality of such multiply units to form a pipeline active 
filter, and not in the circuits per se. Thus, the novel 
organization consists of a row of multipliers (a decoder, 
shifters/inverters and carry-save adders), one multiplier 
for each bit of the product, so interconnected that carry 
propagation is not required in this preliminary multipli- 
cation, and a single row of full adders, one for each 
multiplier stage, which will propagate the carries neces- 
sary for a temporary product, Wp, of the weight, W, 
and pixel, p. The row of full adders is used again to form 
the sum of the product with the accumulated sum 2Wp 
from a previous multiply unit. 

FIG. 7 shows five such multiply units which, as will 
be appreciated, can conveniently be formed on a single 
chip 30. Each multiply unit is represented by a rectangle 
21 divided into two parts, a first part 21 a comprised, as 
shown in FIG. 5, of a row of shifters/inverters 11 with 
carry-save adders 12, and sum and carry storage devices 
13, 14, and a second part 216 comprised, as shown in 
FIG. 5, of a row of full adders 15 with storage devices 
16, 17 to store bits of temporary products Wp and accu- 
mulated sum of products 2Wp. Each multiply unit 
receives a 16-bit weight, W, and an 8-bit pixel, p, to 
produce a truncated 22-bit product added to the accu- 
mulated sum of products 2Wp from preceding multiply 
units. 

FIG. 8 shows seven such multiply chips 30 arranged 
to multiply each of a succession of pixels by a set of 
weights Wi through W35 prestored in the chips to form 
a pipeline active filter 40. FIG. 9 shows how 35 such 
active filters 40 may be connected to the train of pixels 
through line delays 42 to form a filter 50 used to multi- 
ply the succession of pixels by a kernel of 35 by 35 
weights. Note that as the products Wp are formed, they 
are added to products of previous multiply units in each 
line of pixels, and that the accumulated sum of products 
for each line is added to the accumulated sum of prod- 
ucts of the next line so that the final pixels are weighted 
by the 35 by 35 kernel of predetermined weights. 

And finally, FIG. 10 illustrates how the 35 by 35 
pipeline active filter is connected to a data processor 52 
through a bus 54 and interface 55 to accept pixel data, 
multiply it by the 35 by 35 kernel, and return the filtered 
pixel data to the data processor through a bus interface 
unit 56. A buffer 57 is provided between the pipelined 
active filter 50 and the input bus interface 55 to store at 
least a significant fraction of frame of pixels for filtering. 
A buffer 58 is similarly provided between the pipelined 
active filter and the return bus interface 56. Control 
lines through which the data processor supervises the 
system are indicated by dotted lines. The system is syn- 
chronized by a local clock generator 59, and the pipe- 
lined active filter itself is controlled by a sequence con- 
trol unit 60 for carrying out the necessary operations in 
the multiply units of the filter as will be more fully 
described with reference to FIGS. 5 and 6. Although 
shown in FIG. 10 as a single control unit for all chips 
having five multiply units, it is preferred to duplicate 
the sequence control unit on each chip as shown in FIG. 
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7 by a block 60 labeled control timing logic. The se- 
quence control function is implemented with a sequence 
step counter. Distributed sequence control then mini- 
mizes the number of input pins required for each chip. 
With that overview, the structure and operation of one 5 
multiplier for the i-th and adjacent stages of a multiply 
unit will now be described with reference to FIGS. 5 
and 6. 

The i-th bit of the multiplier (weight W) is prestored 
in one stage of a shift register 10, of which only three 10 
stages (W,_i, W/and W/+i) are shown. Loading this 
register is indicated to be from left to right, i.e., from the 
less significant stage to the more significant stage. Once 
loaded with the prescribed weight, processing of pixel 
data may commence. 15 

The 8-bit pixels, p, are received by a modified Booth 
decoder 20 common to five multiply units, as shown in 
FIG. 7, but shown in FIG. 5 as though dedicated to just 
one multiply unit of which only three of twenty-two 
stages are shown. The output of the decoder is a set of 20 
four signals, namely lx, 2X, ADD (+) and SUB(— ), 
as explained with reference to FIG. 2 a and 2b. Each 
multiplier includes its own shifter/inverter (S/I) circuit 
11 responsive to those four signals, as explained with 
reference to FIGS. 2c and 2d. Each multiplier has two 25 
weight bits as inputs, one from the stage i — 1 of the 
register 10 of less significance, and one from the stage i 
of the register 10 of the same significance. The shifter- 
--/inverter will select bit W/_i or W /, according to 
whether 2 X or 1 X is true, and invert the bit according 30 
to whether it is to be added or subtracted (complement- 
ing and adding) which depends upon whether ADD or 
SUB is true. The 2X and IX control distributed from 
the decoder 20 to each shifter/inverter is indicated by 
circled 2X and 1 X in the inputs to the shifter/inverter 35 
of the i-th stage. The ADD and SUB control is similarly 
- distributed, though not otherwise indicated in FIG. 5, 
to all other stages of the multiply unit. The selected bit 
W/_i or W/is then added or subtracted (by addition of 
the complement) in the associated carry-save adder 12 40 
shown in FIG. 3. 

The resulting sum bit, S, is temporarily stored in a 
bistable device 13 shown with the next less significant 
stage of the register 10 for convenience, and any carry 
bit, C, is temporarily stored in a bistable circuit 14 45 
shown directly below the carry-save adder of the i-th 
stage. Note that the sum store device associated with 
the more significant stage i+l is shown below the i-th 
stage, but that its output is an input not to the i-th carry- 
save adder, but to the next less significant bit carry-save 50 
adder, while the carry of the i-th stage is fed from bista- 
ble circuit 14 directly to the next less significant bit 
carry-save adder. All of this carry-save add and store 
takes place during a first mode of operation indicated by 
a circled 1 in the connecting lines of FIG. 5, and a 55 
circled 1 in the timing diagram of FIG. 6. 

This first mode lasts for five cycles of the system 
clock in order to process all digit bits of a pixel. In the 
sixth cycle of the clock, which initiates a second mode 
indicated by a circled 2 in the connecting lines of FIG. 60 
5, and a circled 2 in the timing diagram of FIG. 6, the 
sum and carry of each multiplier stage is loaded into the 
full adder 15 associated with the less significant stage. 
Each half cycle of the clock is a step in the sequence 
controlled by the control unit 60 (FIG. 10), keeping in 65 
mind that the 8-bit pixel (multiplier in the modified 
Booth algorithm) requires five passes to decode in 
groups of three as follows: 


Pass 1 

0 

Pi 

P2 

Pass 2 

P2 

P3 

P 4 

Pass 3 

P 4 

P 5 

P6 

Pass 4 

P 6 

P7 

Ps 

Pass 5 

P8 

0 

0 


In the first pass, the first two bits of the multiplier ap- 
pear in place for modified Booth decoding. Then shift- 
ing two bits at a time, it takes four more passes for all 
eight bits to be processed. The weights (multiplicands) 
are loaded before processing starts, and are held for the 
duration of the image frame processing, although 
weights could be altered during processing, if desired. 

After five clock cycles, processing an 8-bit pixel 
through the multiplier is complete. In order that the 
results (saved sum and carry) can be added, the second 
mode is initiated by the first half of the sixth clock per- 
iod at the beginning of mode 2 identified by a circled 2 
in FIG. 6. The second mode is comprised of two phases, 
an a phase which extends through the first clock pulse 
period after the next start pulse, and a following J3 phase 
which extends from the end of a for three clock cycles, 
which is to say for more than half the period remaining 
before the next start pulse. The a phase may be made as 
long as necessary to await the start pulse when the next 
pixel is available. In practice, this hold period may be 
reduced to zero for very fast VLSI circuits. 

As just noted there are two phases in this second 
mode. The operations to be completed in the first phase 
are identified by a circled a in the lines connecting the 
sum and carry of each multiplier stage to the full adder 
15 of the next less significant stage, and connecting the 
sum output of the full adder to a device 16 for storing a 
temporary product, Wp. This first phase operation is 
identified in FIG. 6 also by a circled a. Note that only 
the carry from the next more significant stage is stored 
in an input storage device of the adder. This occurs at 
the center of the sixth clock pulse, as indicated by the 
waveform labeled “load full adder” in FIG. 6. While 
both the sum and carry bits could be stored, in practice 
it is not necessary to store either one since addition 
begins immediately, and the temporary product Wp is 
loaded into the storage device 16 as indicated by the 
waveform labeled “load Wp into 16” in FIG. 6. The 
reason the carry is loaded into the input storage device 
is only because it is necessary to later store the sum of 
products bit, £Wp, in storage device 17, and that stor- 
age device is used to multiplex between first the carry of 
the next more significant stage, and then the sum of 
products, 2Wp. 

The carry and sum are added in each full adder dur- 
ing the a phase of the second mode to the temporary 
product, Wp, from device 16, and the sum, Wp, is 
stored in the device 16. All of this takes place in the 
period between the middle of the first clock pulse cycle 
of mode 2, to the end of one clock pulse cycle after the 
next start pulse. This is sufficient time to propagate the 
carries in forming the temporary product. Thus, the 
operation taking place during a long pulse period la- 
beled “load Wp into 16” is to perform the first addition 
in the full adder. The addition is of the sum and carry 
bits saved during mode 1. The temporary product, Wp, 
is indicated as an input to the full adder 15 by a circled 
fio in the connection shown in FIG. 5. The next start 
pulse loads the next pixel into the decoder 20 for pro- 
cessing, and the mode 1 sequence is repeated for the 
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next pixel. Meantime, the temporary product, Wp, is 
“shifted” (gated) into the full adder as an input labeled 
Po in FIG. 5. This occurs in the middle of the second 
clock pulse period after a start pulse. At the end of that 
second pulse period, the preceding sum of products 5 
ZWp from storage device 17 is loaded as input P\ into 
the input storage of the adder to start the second addi- 
tion of Po to P\. The output indicated as Pi in FIG. 5 is 
then held in storage device 17. Because a new sum of 
products is being stored (from a preceding multiply 10 
unit), it is necessary to store the old sum of products 
(input Pi to the full adder) in the input storage device of 
the full adder to free the storage device 17 for the new 
sum of products (input pi from the preceding multiply 
unit). 15 

Thus, while the next pixel is being processed through 
five passes in the carry-save adder, the full adder adds 
the temporary product stored in device 16 to the old 
sum of products 2Wp previously stored in a storage 
device 17 to form a new sum of products 2Wp. The old 20 
sum of products is identified by a circled P\ in the con- 
nection from the storage device 17 to the full adder 15, 
and new one is identified by a circled Pi at the output of 
the full adder 15, which forms the sum Po+Pl^Pl 
where Po is the temporary product sum Wp formed by 25 
adding the bits transferred to the full adder, and from 
the full adder to the storage device 16 during the a 
phase of the second mode. The new sum of this multiply 
unit is then stored in the device 17 of the next multiply 
unit in succession. While reference has been made to 30 
only one stage, it should be understood that in this ex- 
emplary embodiment there are 22 stages in each multi- 
ply unit operating in parallel. 

In summary, the first mode forms the sum and carry 
bits of a new temporary product, Wp, while a second 35 
mode forms the temporary product and adds it to the 
accumulated sum of products, 2Wp from a preceding 
multiply unit. These two modes overlap from one pixel 
to the next, which is to say that while the second mode 
for one pixel is being completed, the first mode for the 40 
next pixel is started and completed. The second mode 
has two phases, a first a phase during which the product 
sum produced by the multiplying unit is formed and 
saved in the storage device 16. Operations of the first 
phase are identified in FIGS. 5 and 6 by a circled a and 45 
they are for addition of the sum and carry of the next 
more significant carry-save adder, thus forming a tem- 
porary product sum, identified in FIGS. 5 and 6 by a 
circled Po> and a second P phase to form a new product 
sum 2Wp identified in FIG. 5 by a circled Pi from the 50 
addition of the temporary product Wp to an old sum of 
products 2Wp stored in the device 17 from the preced- 
ing multiplying unit, and identified in FIG. 5 by a cir- 
cled P\. 

The description of FIG. 5 began with reference to 55 
just the i-th stage of a 22 bit multiply unit, it being un- 
derstood that, in this exemplary embodiment, there are 
five 22 bit multiply units on a chip operating with a 
single multiplier decoder 20, as shown in FIG. 7. But in 
the end it became convenient to speak of the sum 2Wp 60 
of the products of the entire multiply unit as formed by 
the array of full adders (stages) which have carry prop- 
agation between them. Since this full adder is used in 
this way while the next pixel is being processed, there is 
sufficient time for the full adder to propagate the 65 
carries, first in forming the new temporary product Wp 
during the indefinite hold period shown in FIG. 6, and 
then in forming the new sum of products 2Wp shown as 


10 

Pi going out of the timing diagram at the lower right of 
FIG. 6. That sum of products 2Wp is introduced into 
the next pipelined multiply unit shown in FIG. 7, as 
indicated by the circled Pi in the connecting lines in 
FIG. 5. This is all under the sequence control of the 
control unit 60 (FIG. 10) which generates the signals 
shown in the timing diagram of FIG. 6 to effect the 
control described with reference to FIGS. 5 and 6. The 
sequence of steps, one step for each half cycle of the 
clock, are as follows: 

FIRST MODE — (initiated by START pulse) 

1. 8-bit pixel loaded into decoder register (not shown 
in FIGS. 2 a and 2b). Enter P o (=0), Pi, P 2 in decoder 
proper. Multiply by adding or subtracting lx or 2x 
weight bits as determined by decoder, and latch de- 
coder output. 2. Prior carry and sum blocked, and zero 
inserted into carry-save adder. Generate sum and carry 
and store in devices 13 and 14. Enter P 2 , P 3 and P 4 in 
decoder proper. 

3. P2» P 3 , P 4 shifted into decoder proper. Shift carry 
and sum of carry-save adders. 

4. Multiply by adding or subtracting lx or 2x 
weight bits as determined by decoder and add prior 
carry from next more significant stage and sum from 
second more significant stage. 

5. and 6 . Repeat steps 3 and 4 with P 4 , P 5 and P 6 . 

7 and 8. Repeat steps 3 and 4 with P6, P 7 and Ps- 

9 and 10. Repeat steps 3 and 4 with Ps, P9(=0) and 
Pio(=0). 

SECOND MODE 

12 through 2V Complete a connections in FIG. 5 for 
storage of sum and carry from carry-save adder of next 
more significant stage. Hold through next START 
pulse. Generate temporary product sum by adding sum 
and carry from carry-save adder of next more signifi- 
cant stage, with full propagation of carries, and store 
temporary product sum in device 16. 

1'. START pulse generated to commence multiplica- 
tion of next pixel by steps 1-10 above. 

2'. Continue to hold temporary product from step 12. 

3'. Enter temporary product from storage device 16 
to full adder. 

4' through 8'. Generate the new sum of products 
2W p by adding content of device 16 to content of de- 
vice 17, after first transferring content of device 17 to 
input storage of full adder 15 to temporary product sum 
in device 16 to output a new sum of products SWpfrom 
full adder. 

9'. Hold new sum of products, and continue to hold 
until the second clock pulse after the next START 
pulse. 

Although particular embodiments of the invention 
have been described and illustrated herein, it is recog- 
nized that modifications and variations may readily 
occur to those skilled in the art. Consequently, it is 
intended that the claims be interpreted to cover such 
modifications and variations. 

What is claimed is: 

1. A pipeline active filter wherein a succession of 
n-bit data elements are processed sequentially, and each 
n-bit element need only be accessed once to be multi- 
plied by a number of weights, comprising: 

a plurality of multiply units having a modified Booth 
decoder means for decoding three bits of each 
element at a time in increasing order of significance 
with one bit overlap, and generating commands to 
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add or subtract the weight, or zero, and to multiply 
the weight by 1 or 2 when it is added or subtracted, 
and a plurality of multiplier stages, one multiplier 
stage for each bit of the weight, each multiplier 
stage comprising; 5 

inverter/shifter means for shifting or not shifting by 
one bit position the bit of the weight to be added or 
subtracted, thereby multiplying the weight by one 
or by two, and inverting or not inverting the bit of 
the weight, thereby adding the weight or forming 10 
the complement for subtraction of the weight by 
addition, 

a carry-save means for adding or subtracting zero or 
the weight multiplied by one or the weight multi- 
plied by two without propagation of carries, all in 15 
response to said modified Booth decoder means, 
each time shifting the sum of partial products rela- 
tive to the weight so that the weight is added to 
successively higher orders of the partial products 
being accumulated without propagation of carries, 20 

storage means for the sum and carry bits of each 
carry-save means for transfer of the sum bit to 
second less significant carry-save means, and of the 
carry bit to the next less significant carry-save 
means during multiplication of successive bits of 25 
the element by addition of the weight, and for 
storing the sum and carry bits of the last partial 
product generated after multiplication by the last 
bit of the element, and 

a full-adder means for adding said final sum and carry 30 
of the next more significant stage to form a tempo- 
rary product, and thereafter for adding said tempo- 
rary product to an old sum of products from a 
preceding multiply unit to form a new sum of prod- 
ucts as an output to the next multiply unit. 35 

2. A pipelined active filter as defined in claim 1, in- 
cluding a number of sets of multiply units connected in 
cascade, each set with separate modified Booth decoder 
means for decoding pixels in sequence, and including 
line delay means between decoders, whereby the sum of 40 
products of said data elements multiplied by said 
weights in said sets of multiply units to multiply each of 
said elements by a kernel of weights, said kernel consist- 
ing of a number of weights equal to the number of multi- 
ply units in a set and an additional number of weights 45 
equal to the number of sets. 

3. A pipeline active filter for image data processing 
having a number M X N of like multiply units connected 
in cascade in sets with the sum of products out of one set 
connected as an input to the next set for producing a 50 
sum of MxN image elements multiplied by MxN 
weights, where M is the number of image elements in a 
line of N lines of a two dimensional image, each of said 
multiplying units for a given weight comprising: 

means for storing a multiplier W having a plurality of 55 
bits, 

means for accessing each multiplicand in sequence for 
multiplication by said multiplier, 

modified Booth decoder means for determining the 
operation to be performed for multiplication of 60 
successive bits of said multiplicand by multiplier 
bits, and 

multiplying means responsive to said modified Booth 
decoder means and said multiplier storing means, 
for forming the product of said multiplicand and 65 
said multiplier as a sum of partial products without 
propagation of carries until a final sum is formed as 
the product, said multiplying means comprising; 


a row of carry-save adders, one adder for each bit W / 
of said multiplier for forming the sum and carry 
bits of bit multiplication without propagation and 
addition of carries between adders, but with shift- 
ing of sum and carry bits of each partial product 
relative to said bits of said multiplier, 
a shifter/inverter for each carry-save adder respon- 
sive to said modified Booth encoder for adding or 
subtracting by adding the complement of a bit W , 
or W/_i in said carry-save adder in forming a sum 
and carry, 

means for temporarily storing said sum and carry bits 
from carry-save adders between steps of successive 
processing of multiplicand bits, 
a row of full adders, one full adder for each multipli- 
cand bit, adding all said sum and carries stored in 
said temporary storing means after the last of said 
multiplicand bits have been processed through said 
modified Booth decoder, whereby a product is 
formed with a row of carry-save adders, one per 
multiplicand bit, without the need for carry propa- 
gation in forming partial products, 
means for storing said product temporarily, 
control means for adding a temporary product stored 
in said temporary product storage means with a 
sum of products from a preceding multiply unit of 
a set, or preceding set, to form a new sum of prod- 
ucts input to the next succeeding multiply unit of a 
set, or the next set, and 

means for delaying successive elements by one line 
between groups of M multiplying units set to form 
a two-dimensional convolution of elements. 

4. A pipeline active filter for image data processing as 
defined in claim 3 wherein said multiplicand is com- 
prised of m bits, and said modified Booth decoder sam- 
ples and decodes three multiplicand bits at a time, start- 
ing with 0 and the first two multiplicand bits, and there- 
after shifting two bit positions and examining three bits 
during each successive multiply cycle for n cycles, 
where n is a minimum number of cycles necessary for 
said modified Booth decoder to thus process all multi- 
plicand bits and said carry-save adder to perform carry- 
save addition operations, and wherein said full adder 
performs two carry propagate additions following each 
of n cycles, whereby all sum and carry bits for multipli- 
cation of the next element by a weight are formed and 
temporarily stored in n cycles, while a product is 
formed by addition of previous partial product sum and 
carry bits and said product thus formed is added to a 
previous sum of products to form a new sum of prod- 
ucts. 

5. In a multiplier wherein multiplicands are processed 
sequentially, each multiplicand having a plurality of 
bits, the combination of: 

means for storing a multiplier W having a plurality of 
bits, 

means for accessing each multiplicand in sequence for 
multiplication by said multiplier, 
modified Booth decoder means for determining the 
operation to be performed for multiplication of 
successive bits of said multiplicand by multiplier 
bits, and 

multiplying means responsive to said modified Booth 
decoder means and said multiplier storing means 
for forming sum and carry bits of partial products 
of said multiplicand and said multiplier without 
propagation of carries until a final addition of said 



4,644,488 


sum and carry bits is formed as the product, said 
multiplying means comprising; 
a row of carry-save adders, one adder for each bit 
W/ of said multiplier for forming the sum and 
carry bits of bit multiplication without propaga- 5 
tion and addition of carries between adders, but 
with shifting of said sum and carry bits relative 
to said bits of said multiplier, 
a shifter/inverter for each carry-save adder respon- 
sive to said modified Booth encoder for adding 1 ° 
or subtracting by adding the complement of a bit 
W/or W/_ i in said carry-save adder in forming a 
sum and carry, 

means for temporarily storing said sum and carry 
bits from carry-save adders between steps of 15 
successive processing of multiplicand bits, and 
a row of full adders, one full adder for each multi- 
plicand bit, adding all said sum and carries stored 
in said temporary storing means after the last of 
said multiplicand bits have been processed 20 
through said modified Booth decoder, whereby 
a product is formed with a row of carry-save 
adders sum of partial products, without the need 
for carry propagation in forming partial prod- 
ucts. ^ 

6. The combination of claim 5 including means for 
storing temporarily said product produced by said full 
adder, and means for temporarily storing a previous 
sum of products, and including control means for using 3Q 
said full adder to form a new sum of products by adding 

a product just formed to said previous sum of products 
while said multiplying means processes bits of a suc- 
ceeding multiplicand, whereby a plurality of multipli- 
cands may be multiplied by a multiplier to form a sum of 35 
products. 

7 . A pipeline processor for image elements having a 
number of like multiplying units connected in cascade in 
sets with the sum of products out of one set connected 

as an input to the next set, and including means for 40 
delaying successive lines of image elements applied as 
multiplicands to said multiplying units in sets by the 
number of elements in an image row, each multiply unit 
having means for accessing each element only once for 
a set of weights by each of successive elements, said 45 
multiplying means for each weight comprising: 
means for storing a multiplier W having a plurality of 
bits, 

means for accessing each multiplicand in sequence for 
multiplication by said multiplier, 50 

modified Booth decoder means for determining the 
operation to be performed for multiplication of 
successive bits of said multiplicand by multiplier 
bits, and 


multiplying means responsive to said modified Booth 
decoder means and said multiplier storing means 
for forming partial products of said multiplicand 
and said multiplier without propagation of carries 
until a final addition is performed as the product, 
said multiplying means comprising; 
a row of carry-save adders, one adder for each bit W,- 
of said multiplier for forming the sum and carry 
bits of bit multiplication without propagation and 
addition of carries between adders, but with shift- 
ing of said sum and carry bits relative to said bits of 
said multiplier, 

a shifter/inverter for each carry-save adder respon- 
sive to said modified Booth encoder for adding or 
subtracting by adding the complement of a bit W/ 
or W/_ 1 in said carry-save adder in forming a sum 
and carry, 

means for temporarily storing said sum and carry bits 
from carry-save adders between steps of successive 
processing of multiplicand bits, 
a row of full adders, one full adder for each multipli- 
cand bit, adding all said sum and carries stored in 
said temporary storing means after the last of said 
multiplicand bits have been processed through said 
modified Booth decoder, whereby a product is 
formed with just one row of carry-save adders sum 
of partial products, without the need for carry 
propagation in forming partial products, 
means for storing said product temporarily, and 
control means for adding a temporary product stored 
in said temporary product storage means with a 
sum of products from a preceding multiply unit of 
a set, or preceding set, to form a new sum of prod- 
ucts input to the next succeeding multiply unit of a 
set, or the next set. 

8. A pipeline processor as defined in claim 7 wherein 
said multiplicand is comprised of m bits, and said modi- 
fied Booth decoder samples and decodes three multipli- 
cand bits at a time, starting with 0 and the first two 
multiplicand bits, and thereafter shifting two bit posi- 
tions and examining three bits during each successive 
multiply cycle for n cycles, where n is a minimum num- 
ber of cycles necessary for said modified Booth decoder 
to process all multiplicand bits and said carry-save 
adder to perform carry-save addition operations, and 
wherein said full adder performs the first of said two 
carry propagate additions following each of n cycles, 
whereby all sum and carry bits for a multiplication are 
formed and temporarily stored in n cycles, while a sum 
of previous sum and carry bits are added to form a 
temporary product and a temporary product is added to 
a previous sum of products to form a new sum of prod- 
ucts with propagation of carries. 

***** 
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